DittoSeq is a package for visualization of single cell (or bulk) RNAseq data. In this tutorial, I will walk through using the package to explore some data available on the 10X genomics website. DittoSeq is a visualization package that does not handle the data processing side that is better handled already by others, so I have pre-processed the data we will use in Seurat. You can either:
install.packages("devtools")
devtools::install_github("dtm2451/DittoSeq")
pbmc <- readRDS("~/Downloads/pbmc.rds")
library(DittoSeq)
library(Seurat)
We can use either Seurat’s TSNEPlot or DBDimPlot to look at where distinct cells fall in tSNE space.
# Seurat:
TSNEPlot(pbmc, do.label = T)
# DBDimPlot
DBDimPlot("ident", object = "pbmc", do.label = T,
size = 2.5, do.letter = F, label.size = 3) #Inputs on this line are all optional, but do make this particular plot look better.
Notice that the biggest differences between the Seurat plot and DBDimPlot here are the colors and that DBDimPlot puts a box around the labels (by default at least) to make them easier to read.
# To remove those boxes and reveal all cells that might be hidden, use the input highlight.labels = F:
DBDimPlot("ident", object = "pbmc", do.label = T,
highlight.labels = F,
size = 2.5, do.letter = F, label.size = 3)
# To additionally use median-based ellipses to highlight where different groups mostly exist in the plot, you can add the ellipse = T input:
DBDimPlot("ident", object = "pbmc", do.label = T,
highlight.labels = F, ellipse = T,
size = 2.5, do.letter = F, label.size = 3)
# OR, to use lettering in order to make the idenitities more obvious instead of using on-plot labeling, utilize the do.letter input instead:
DBDimPlot("ident", "pbmc",
size = 2.5, do.letter = T, label.size = 3) # if do.letter is set to FALSE, feature is turned off. If do.letter is set to NA, it will turn on when the number of groups is >= 8.
The input reductions.use defaults to tsne for all Seurat objects, but can be set to any dimensional reduction that has been run on your dataset.
DBDimPlot("ident", "pbmc", reduction.use = "pca") # Because there are 8 groups, lettering actually turns on for this dataset by default, but we can deactivate that with do.letter=F as we did in the tsne lines above.
# With ellipse outlines of groups added and lettering off
DBDimPlot("ident", "pbmc", reduction.use = "pca",
ellipse = T, do.letter = F)
If you want to find out what types of dimensional reduction have been run on your dataset, you can use the following code:
names(pbmc@dr)
## [1] "pca" "tsne"
All outputs of this will work as inputs to DBDimPlot’s reductions.use.
There are many other options for customization of the DBDimPlot() function for when showing discrete data. You can use ?DBDimPlot to discover them all, but a few key inputs that I’d like to note:
main, sub, ylab, xlab, legend.title = these inputs can be used to manually set or remove particular titles.size = sets the size of the dots. Default = 1.shape = this can either be a number (which will change the shape of all dots to a different shape), or the quoted name of a meta.data slot that you would like to have the shapes of your data points depend one. For example, if shape = “BroadCelltype” were used after generating that metadata slot with code that is below, all the myeloid cells would have a different shape than the lymphoid cells.do.hover / data.hover = when do.hover is set to TRUE, the normal ggplot output is turned into a ggplotly plot instead and quoted inputs to data.hover are used to determine what pieces of data should be shown when the cursor is help over each cell. This is a function the Seurat TSNEPlot and PCAPlot functions can handle as well, and I thought it was important that I port it over!Okay, it looks like this dataset includes 8 different celltypes that have already been annotated. Now we want to explore the differences between these celltypes.
Beforehand, making a variable called DEFAULT and setting it to the “name.of.our.object” is a shortcut that will eliminate the need for adding object = "pbmc" to all of our DittoSeq code. So let’s do that!
#Set DEFAULT so we don't have to keep adding an object = "pbmc" input
DEFAULT <- "pbmc"
Also many DittoSeq visualizations allow for easy comparison accross samples in addition to clustering/celltypes. All that is needed is a meta.data slot that contains the grouping information. So lets create some mock “Samples” meta.data. For your dataset, you may already have something like this. Let’a also make meta.data slots for “Celltype” and “BroadCelltype”.
# The idents were assigned celltypes. Lets make a celltype metadata.
pbmc@meta.data$Celltype <- pbmc@ident
# Let's also make a BroadCelltype metadata, to give us another thing to group things by for playing.
pbmc@meta.data$BroadCelltype <-
#These lines will create the meta.data input which will be "Lymphoid" when the celltype is B, T, or NK cells, and "Myeloid" otherwise.
unlist(sapply(meta("Celltype"), function(X){
ifelse((X == "B cells" | X =="CD4 T cells" | X =="CD8 T cells" | X =="NK cells"),
"Lymphoid",
"Myeloid")}))
# And finally, let's make 3 mock samples groups
Sample <- rep("Sample1", length(pbmc@cell.names))
Sample[round((1/3*length(pbmc@cell.names)):round((2/3*length(pbmc@cell.names))))] <- "Sample2"
Sample[round((2/3*length(pbmc@cell.names)+1):length(pbmc@cell.names))] <- "Sample3"
pbmc@meta.data$Sample <- Sample
We can use get.metas() to get the names of all the metadata slots of an object.
get.metas()
## [1] "nGene" "nUMI" "orig.ident" "percent.mito"
## [5] "res.0.6" "Celltype" "BroadCelltype" "Sample"
And we can use the meta.levels() function to check that we successfully created the BroadCelltypes and Sample metadata slots:
meta.levels("BroadCelltype")
## [1] "Lymphoid" "Myeloid"
#Adding the input table.out = T makes it give counts of the different levels:
meta.levels("BroadCelltype", table.out = T)
##
## Lymphoid Myeloid
## 1956 682
#And for Sample:
meta.levels("Sample", table.out = T)
##
## Sample1 Sample2 Sample3
## 879 880 879
There are about equal numbers of cells in all the samples, as there should be for the 1/3s code used… so that worked.
For this specific purpose, you will want to use DittoSeq’s DBBarPlot() function. There is no Seurat counterpart for this role, other than a suggestion of using the table() function of base-R:
table(meta("Celltype"),meta("Sample"))
##
## Sample1 Sample2 Sample3
## CD4 T cells 384 412 355
## CD14+ Monocytes 147 168 164
## B cells 118 100 124
## CD8 T cells 104 92 112
## FCGR3A+ Monocytes 55 47 55
## NK cells 52 50 53
## Dendritic cells 13 8 11
## Megakaryocytes 6 3 5
For an easier to interpret visual representation, I present…
#####- My "identity plotting function, DBBarPlot -#######
# In it's simplest form, after DEFAULT <- "pbmc" has been set, DBBarPlot requires only 2 inputs:
DBBarPlot("Celltype", group.by = "Sample")
Now we can easily see that the most common celltype accross all samples is CD4 T cells, followed by monocytes, then either CD8 T cells or B cells for all samples.
This can be especially useful when multiple tissues, or ages of samples, or conditions exist within the dataset. Here, because our different samples are actually just slices of a single sample, it makes sense that there are no major differences between our groupings.
In order to change how the cells are grouped on the X axis, just change the metadata given to group.by.
#There aren't many alternate metadata options here, so I'm using "orig.ident" which is the same for cells here.
# Some good alternates would be "age", or "condition", or "tissue"
DBBarPlot("Celltype", group.by = "orig.ident", rotate.labels = T)
In order to break down the samples in a different way, just change the first input:
DBBarPlot("BroadCelltype", group.by = "Sample")
You can reorder, relabel, or rotate the x-axis groupings and x.labels:
DBBarPlot("Celltype", group.by = "Sample",
rotate.labels = F,
reorder.x = c(3,2,1), # The method: look at an un-reordered plot, then plug in which position from left-to-right that you want the left-most group to move to, then where the originally second-from-left should go, and so on.
x.labels = c("Pikachu", "Eevee", "Miltank"))
You can adjust what the annotations are called:
DBBarPlot("Celltype", group.by = "Sample",
rename.groups = c("Charizards", "Vaporeons", "Sceptiles", "Zapdos", "Wartortles", "Exeggutors", "Mr. Mimes", "Jynxs"))
We can also adjust the tickmark values of the y-axis, (but not the range which is always 0 to 1):
DBBarPlot("Celltype", group.by = "Sample",
y.breaks = seq(0,1,.25))
We can set a legend title as well as changing all plot titles:
DBBarPlot("Celltype", group.by = "Sample",
rename.groups = c("Charizards", "Vaporeons", "Sceptiles", "Zapdos", "Wartortles", "Exeggutors", "Mr. Mimes", "Jynxs"),
legend.title = "Pokemon",
ylab = "Percent of Wild Encounters",
main = "Encounters by Visit",
sub = "Pallet Town Bushes",
xlab = "Trainers",
x.labels = c("Ash", "Gary", "Prof Oak"))
And finally, we can also set do.hover = T to extract the counts & exact percentages info from the barplot:
DBBarPlot("Celltype", group.by = "Sample",
rename.groups = c("Charizards", "Vaporeons", "Sceptiles", "Zapdos", "Wartortles", "Exeggutors", "Mr. Mimes", "Jynxs"),
legend.title = "Pokemon",
ylab = "Percent of Wild Encounters",
main = "Encounters by Visit",
sub = "Pallet Town Bushes",
xlab = "Trainers",
x.labels = c("Ash", "Gary", "Prof Oak"),
do.hover = T) # Notice that ggplotly ignores our rename.groups change of the identity names. I may fix this bug in a future version.
## Warning: Ignoring unknown aesthetics: text
Now that we have shown that our clusters are spatially distinct in a dimensional reduction plot and that their breakdown accross samples makes sense, we probably want to check that there are actual, meaningful differences between cells of different clusters. Others have made many ways of picking out genes that are differetially expressed accross clusters. A few Seurat functions for doing so are FindMarkers(), FindAllMarkers(), FindConservedMarkers().
Once you have your set of genes, you may be lookinng for a way to visually inspect how expression changes accross your samples or your clusters. You probably also will want to validate your gene lists… I’ve lost count of how many times genes have passed my cutoffs for being differentially expressed but then not actually seemed to truly be differentially expressed.
The DBPlot() and DBDimPlot() function are both great methods for doing this.
Here, for demonstration purposes, we will use the genes the Satija lab had used (and validated) for calling of celltypes, so well… they will indeed be expressed differently accross the identities.
#- Seurat data-on-a-tsne/pca-plot function:
FeaturePlot(object = pbmc,
features.plot = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A",
"FCGR3A", "LYZ", "PPBP", "CD8A"),
cols.use = c("grey", "blue"),
reduction.use = "tsne")
These are a little small for my liking. Also, there is no legend provided, even if we were to just ask for a plot for a single gene.
######- DittoSeq DBDimPlot / multiDBDimPlot function -#####
multiDBDimPlot(c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A",
"FCGR3A", "LYZ", "PPBP", "CD8A"),
object = "pbmc")
We set DEFAULT <- "pbmc" earlier, so the object input can be skipped. The following code would produce the same plot:
multiDBDimPlot(c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A",
"FCGR3A", "LYZ", "PPBP", "CD8A"))
Say you actually want to have your identities displayed as well as reference, you can just add it to the var list. The FeaturePlot function cannot handle metadata, but with DBDimPlot, you can just add “ident”, or any other metadata, to your list.
multiDBDimPlot(c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A",
"FCGR3A", "LYZ", "CD8A", "ident"),
do.letter = F)
Alternatively, if you want to scan through an individual gene / an individual piece of metadata, on a per celltype basis, that’s what the multiDBDimPlot_vary_cells function is for:
#Gene Expression / continuous variable
multiDBDimPlot_vary_cells("CD3E", cells.use.meta = "Celltype",
ncol=4)
#Meta Information / discrete variable
multiDBDimPlot_vary_cells("Celltype", cells.use.meta = "BroadCelltype",
do.letter = F, do.label = T, label.size = 2,
ncol =2)
## Warning: Removed 4 rows containing missing values (geom_label).
## Warning: Removed 4 rows containing missing values (geom_label).
For this _vary_cells function, the All Cells plot and legend can be turned off, and the number of columns and rows can be adjusted with the all.cells.plot, add.single.legend, nrow, and ncol inputs.
multiDBDimPlot_vary_cells("Celltype", cells.use.meta = "BroadCelltype", do.letter = F, do.label = T, label.size = 2,
all.cells.plot = F, add.single.legend = F,
nrow = 1, ncol = 2) #Note: Default for ncol =3, but Default for nrow = as many as needed
## Warning: Removed 4 rows containing missing values (geom_label).
## Warning: Removed 4 rows containing missing values (geom_label).
Before we move on, one last feature that I want to point out now is the do.hover feature. Like for FeaturePlot and DBDimPlot, I have implemented a ggplotly conversion that allows accessing more information about particular cells by just hovering your cursor over them in the plot. To do this turn on hovering with do.hover = T and tell the function what data to pull with data.hover = c("gene1", "gene2", "ident", "metadata1", "metadata2"). This feature does not work for any of my multi-plotters, but ANY expression, clustering, or metadata can be displayed in hover.
Try hovering your mouse over this plot:
DBDimPlot("LYZ", do.hover = T, data.hover = c("LYZ", "ident", "BroadCelltype", "Sample", "CD3E", "CD14"))
## Warning: Ignoring unknown aesthetics: text
More DBDimPlot customization options are possible, and some extras are mentioned towards the end of this document.
Another way of showing expression of genes grouped by an individual celltype, or sample, or any other metadata you might want to group by is the DBPlot() function.
DBPlot() mimics, and vastly expands, Seurat’s VlnPlot() function.
#Seurat's
VlnPlot(pbmc, "MS4A1")
#DittoSeq DBPlot:
DBPlot("MS4A1", group.by = "Celltype", color.by = "Celltype")
And in multi-form:
multiDBPlot(c("MS4A1", "GNLY", "CD3E"),
object = "pbmc",
group.by = "Celltype", color.by = "BroadCelltype")
multiDBPlot(c("CD14", "FCER1A", "FCGR3A"),
object = "pbmc",
group.by = "Celltype", color.by = "BroadCelltype")
multiDBPlot(c("LYZ", "PPBP", "CD8A"),
object = "pbmc",
group.by = "Celltype", color.by = "BroadCelltype")
To change how the data is represented, start with the plots input. You can give this input any combination of c(“jitter”, “vlnplot”, “boxplot”) to tell it which representation to make. And order matters, so:
DBPlot("LYZ", group.by = "Celltype", color.by = "BroadCelltype",
plots = c("jitter","vlnplot")) # This is the default
To put the jitter in front for the violin plot, just switch the order of how these are given. The first item is plotted in the back.
DBPlot("LYZ", group.by = "Celltype", color.by = "BroadCelltype",
plots = c("vlnplot","jitter")) # Changing the order sets the plotting order
To add a boxplot on top…
DBPlot("LYZ", group.by = "Celltype", color.by = "BroadCelltype",
plots = c("vlnplot","jitter","boxplot")) # boxplots can also be added
But these boxplots cover a lot of the data underneath them… let’s make some adjustments:
DBPlot("LYZ", group.by = "Celltype", color.by = "BroadCelltype",
plots = c("vlnplot","jitter","boxplot"),
boxplot.width = 0.3,
boxplot.color = "grey",
boxplot.fill = F,
boxplot.show.outliers = T) # NOTE: Having outliers on alongside a jitter would duplicate those datapoints, so I built in an override. This outlier input is purposefully ignored here.
As you can see, boxplots have lots of customizations. They all start with “boxplot.”
Jitters also have lots of options, all starting with “jitter.”, except for shape.by:
DBPlot("LYZ", group.by = "Celltype", color.by = "BroadCelltype",
plots = c("vlnplot","jitter"),
jitter.color = "gray40",
jitter.size = 1.5,
shape.by = "Sample",
jitter.width = 0.3,
jitter.shape.legend.size = 5)
We can also reorder, relabel, or rotate the x groupings / labels
DBPlot("LYZ", group.by = "Celltype", color.by = "BroadCelltype",
plots = c("vlnplot","jitter"),
rotate.labels = T,
reorder.x = c(3,5,1,2,7,6,8,4), # The method: look at an un-reordered plot, then say which position you want the left-most group to move to, then where the originally second-from-left should go, and so on.
labels = c("CD4 Ts", "CD8 Ts", "Bs", "NKs", "14 Monos", "FC Monos", "DCs", "Megas"))
We can also adjust the range and tickmarks of the y-axis
DBPlot("LYZ", group.by = "Celltype", color.by = "BroadCelltype",
plots = c("jitter","vlnplot"),
y.breaks = c(0,2.5,5,7.5,10))
We can also show the raw counts (or scaled) data instead.
DBPlot("LYZ", group.by = "Celltype", color.by = "BroadCelltype",
plots = c("jitter","vlnplot"),
data.type = "raw")
data.type options:
And finally, we can use do.hover and data.hover with DBPlot in order to select info to display about the cells when we hover the cursor over the jitter dots.
DBPlot("LYZ", group.by = "Celltype", color.by = "BroadCelltype",
plots = c("jitter","vlnplot"), # Notice that ggplotly puts the jitter on top regardless of the order of plots. I may fix this bug in a future version.
data.type = "raw",
do.hover = T, data.hover = c("LYZ","CD8A","Celltype","BroadCelltype","Sample"))
## Warning: Ignoring unknown aesthetics: text
I added a heatmap function to the package more recently. This function is not quite as customizable, and it is essentially just a wrapper for Raivo Kolde’s pheatmap package. But pheatmap is quite fast, which is useful for big single-cell data, and I’ve integrated it with using metadata for annotations:
#install.packages("pheatmap")
library(pheatmap)
DBHeatmap(c("MS4A1","GNLY","CD3E","CD14","FCER1A",
"FCGR3A","LYZ","PPBP","CD8A"))
For real data, you will have more cells than this truncated dataset, so I recommend turning off cell clustering when you are trying out tweaks to the look. Do this by adding cluster_cols=FALSE
DBHeatmap(c("MS4A1","GNLY","CD3E","CD14","FCER1A",
"FCGR3A","LYZ","PPBP","CD8A"),
cells.annotation = "ident",
cluster_cols=FALSE)
Also, as I’m sure you are noticing, cell names are impossible to read here, so I recommend creating a blank meta.data slot, and then setting this to be the cell.names.meta:
pbmc@meta.data$blank <- ""
DBHeatmap(c("MS4A1","GNLY","CD3E","CD14","FCER1A",
"FCGR3A","LYZ","PPBP","CD8A"),
cell.names.meta = "blank")
This cell.names.meta is useful for bulk RNAseq data, or if you want to show only a few cells, and would like to replace the given names with your samples or with a different metadata slot:
DBHeatmap(c("MS4A1","GNLY","CD3E","CD14",
"FCGR3A","LYZ","PPBP","CD8A"),
cells.use = pbmc@cell.names[1:20], #This code directs it to only use the first 20 cells in this dataset
cell.names.meta = "Celltype")
To zoom in on only a certain set of cells/samples, you can use the cells.use option. This works the same as for my other functions. You can either give a list of cell.names to include (which is the same method Seurat functions use), or… Here, we generate a long logical, TRUE/FALSE, the same length as there are cells in our dataset that tells whether a given cell should be included. All cells with the broad celltype of “Lymphoid” are included.
DBHeatmap(c("MS4A1","GNLY","CD3E","CD14","FCER1A",
"FCGR3A","LYZ","PPBP","CD8A"),
cell.names.meta = "blank",
cells.use = meta("BroadCelltype")=="Lymphoid")
We can also add an annotation bar, which is often more useful than cell/sample names, with the cells.annotation input. Give “ident” or any metadata to annotate cells by cluster or metadata:
DBHeatmap(c("MS4A1","GNLY","CD3E","CD14","FCER1A",
"FCGR3A","LYZ","PPBP","CD8A"),
cell.names.meta = "blank",
cells.annotation = "ident",
cells.use = meta("BroadCelltype")=="Lymphoid")
DBHeatmap(c("MS4A1","GNLY","CD3E","CD14","FCER1A",
"FCGR3A","LYZ","PPBP","CD8A"),
cell.names.meta = "blank",
cells.annotation = "Celltype",
cells.use = meta("Sample")=="Sample1")
Additional customization options exist beyond those given in this tutorial. Some are described below, but check the functions’ documentation for more!
# Some "help" code:
??DittoSeq # then click index at the bottom to see a list of all included funcitons
?Simulate # Or substitute with any other function name to get directly to its documentation.
Simulate # Use just a function name without () to see its source code.
For almost all DittoSeq plotting functions, you can use these inputs to modify plot titles:
main = main titlesub = subtitleylab = y axis labelxlab = x axis labelExample:
DBDimPlot("LYZ",
main = "I wanna be the very best",
sub = "like no one ever was",
ylab = "to catch them is my real test",
xlab = "to train them is my cause")
To show only certain cells (/ samples, for bulk data), use the cells.use input. This input takes either a list of names (the method used by Seurat functions), or a logical that contains a TRUE/FALSE inclusion call for every cell. This second option is easier than it sounds, and is often the easier way overall:
Example:
# Option1
DBDimPlot("LYZ",
cells.use = pbmc@cell.names[meta("BroadCelltype")=="Lymphoid"]) # list of cell.names method.)
# Option2
DBDimPlot("LYZ",
cells.use = meta("BroadCelltype")=="Lymphoid") # Logical method
Notice that the logical method input is same as what went into the [] of the cell.names list method. This option is therefore simpler. It also generalizes better to different objects when DEFAULT is changed.
You can add a shape-by variable by setting shape = “metadata”
DBDimPlot("LYZ", shape = "BroadCelltype")
Or you just change all the shapes by setting shape to a number that matches to your shape of choice. For the mappings, see here: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
DBDimPlot("LYZ", shape = 17) #ggplot uses numbers to refer t different shapes.
We can also use plotly and show data about each cell when hovering over them! Quick Note: This feature is not compatible with multi-plots or DBHeatmap.
Turn it on with do.hover = T. For DBBarPlot, that is the only requirement. For other functions, you will also need to provide a list of what data you want to show with data.hover = c("gene1", "gene2", "ident", "metadata1", "metadata2")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
DBDimPlot("LYZ", do.hover = T, data.hover = c("LYZ", "Celltype", "BroadCelltype", "Sample"))
## Warning: Ignoring unknown aesthetics: text
To change the colors of the cell annotations, use the input cell.annotation.colors. This input requires colors be given in list form, so use list(c(“color1”,“color2”,“color3”)).
DBHeatmap(c("MS4A1","GNLY","CD3E","CD14","FCER1A",
"FCGR3A","LYZ","PPBP","CD8A"),
cell.names.meta = "blank",
cells.annotation = "Sample",
cells.annotation.colors = list(c("orange", "yellow", "purple")))
To change the colors in the heatmap, you can modify the heatmap.colors input. Default: heatmap.colors = colorRampPalette(c("blue", "white", "red"))(50). This builds a ramp from blue to white to red with 50 steps. Changing this to heatmap.colors = colorRampPalette(c("green","white","darkblue"))(75) would make the colors go from green to white to darkblue with only 75 steps:
DBHeatmap(c("MS4A1","GNLY","CD3E","CD14","FCER1A",
"FCGR3A","LYZ","PPBP","CD8A"),
cell.names.meta = "blank",
heatmap.colors = colorRampPalette(c("green","white","darkblue"))(75))
Finally, DBHeatmap makes a bunch of manipulations, but then is ultimately a wrapper for the pheatmap function. You can add extra pheatmap inputs directly into your DBHeatmap() input. But if that does not work, you can get the inputs and basic pheatmap script returned with the data.out input. This will spit out the script that would have been used, along with each of the corresponding data. (I’m new to … manipulation, so the script output put together will not actually include any extra inputs you might have given.)
LIST <- DBHeatmap(c("MS4A1","GNLY","CD3E","CD14","FCER1A",
"FCGR3A","LYZ","PPBP","CD8A"),
data.out = TRUE)
LIST$pheatmap.script # This is where to find the script
## [1] "pheatmap(mat = data, color = heatmap.colors, annotation_col = Col_annot, annotation_colors = Col_annot_colors, scale = 'row', labels_col = Col_names,)Ignore , / row#### at end"